A Conversational Paradigm for Multimodal Human Interaction
نویسنده
چکیده
We present an alternative to the manipulative and semaphoric gesture recognition paradigms. Human multimodal communicative behaviors form a tightly integrated whole. We present a paradigm multimodal analysis in natural discourse based on a feature decompositive psycholinguistically derived model that permits us to access the underlying structure and intent of multimodal communicative discourse. We outline the psycholinguistics that drive our paradigm, the Catchment concept that facilitates our getting a computational handle on discourse entities, and summarize some approaches and results that realize the vision. We show examples of such discourse-structuring features as handedness, types of symmetry, gaze-at-interlocutor, and hand ‘origos’. Such analysis is an alternative to the ‘recognition of one discrete gesture out of k stylized whole gesture models’ paradigm. 1. OF MANIPULATION AND SEMAPHORES The bulk of research in the instrumental comprehension of human gestures cluster around two kinds of gestures: manipulative and semaphoric. We define manipulative gestures as those whose intended purpose is to control some entity by applying a tight relationship between the actual movements of the gesturing hand/arm with the entity being manipulated. Semaphores are systems of signalling using flags, lights or arms. By extension, we define semaphoric gestures to be any gesturing system that employs ‘whole gestures’ [1] or stylized dictionaries of static or dynamic hand or arm gestures. This research has been partially supported by the U.S. National Science Foundation STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” and the National Science Foundation KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research”. Much of the work reported here is the collaborative effort of our research team, chief among whom is David McNeill of the University of Chicago. Research employing manipulative gesture paradigm may be thought of as following the seminal “Put-That-There” work by Richard Bolt [2, 3]. Since then, there have been a plethora of systems that implement finger tracking/pointing, a variety of ‘finger flying’ style navigation in virtual spaces or direct-manipulation interfaces, control of appliances, in computer games, and robot control. In a sense the hand is the ultimate ‘multi-purpose’ tool, and manipulation represents properly a large proportion of human hand use. We have observed, however, that gestures used in communication/conversation differ from manipulative gestures in several significant ways [4, 5]. First, because the intent of the latter is for manipulation, there is no guarantee that the salient features of the hands are visible. Second, the dynamics of hand movement in manipulative gestures differ significantly from conversational gestures. Third, manipulative gestures may typically be aided by visual, tactile or force feedback from the object (virtual or real) being manipulated, while conversational gestures are typically performed without such constraints. Gesture and manipulation are clearly different entities sharing between them possibly only the feature that both may utilize the same bodily parts. Semaphoric gestures are typified by the application of some recognition-based approach to identify some gesture gi 2 G where G is a set of predefined gestures. Semaphoric approaches may be termed as ‘communicative’ in that gestures serve as a universe of symbols to be communicated to the machine. A pragmatic distinction between semaphoric gestures and manipulative ones is that the semaphores typically do not require the feedback control (e.g. hand-eye, force-feedback, or haptic) necessitated for manipulation. Systems operating under this paradigm typically define a set of stylized gesture and head movement ‘symbols’ that are then recognized by a variety of techniques, including graph labeling [6], Principal Components Analysis [7], Hidden Markov Models [8, 9, 10] and Neural Networks [10, 11]. Unfortunately such semaphoric hand-use is a miniscule percentage of typical hand-use in communication. Both manipulative and semaphoric gesture models suffer significant shortcomings. While manipulation represents a significant proportion of human natural hand use, natural manipulation situations almost always involve the handling of the artifact being manipulated. Free-hand manipulation interfaces, on the other hand, lack such feedback and rely almost exclusively on visual feedback. Semaphores represent a miniscule portion of the use of the hands in natural human communication. In reviewing the challenges to automatic gesture recognition, Wexelblat [1] emphasizes the need for development of systems able to recognize natural, non-posed and non-discrete gestures. Wexelblat disqualifies systems recognizing artificial, posed and discrete gestures as unnecessary and superficial: If users must make one fixed gesture to, for example, move forward in a system then stop, then make another gesture to move backward, I find myself wondering why the system designers bother with gesture in the first place. Why not simply give the person keys to press: one for forward and one for backward? He considers the natural gestural interaction to be the only one “real” and useful mode of interfacing with computer systems: ... one of the major points of gesture modes of operation is their naturalness. If you take away that advantage, it is hard to see why the user benefits from a gestural interface at all. He underscores the need for systems working with truly conversational gestures, and also emphasizes the tight connection of gestures and speech (conversational gestures cannot be analyzed without considering speech). He expresses urgent need for standard datasets that could be used for testing of gesture recognition algorithms. One of his conclusions, however, is that the need for conversational gesture recognition still remains to be proven (by proving, for example, that natural gesture recognition can improve speech recognition): An even broader challenge in multimodal interaction is the question of whether or not gesture serves any measurable useful function, particularly in the presence of speech. In their review of gesture recognition systems, Pavlović, Sharma and Huang [12] conclude that natural, conversational gesture interfaces are still in their infancy. They state that most current work “address a very narrow group of applications: mostly symbolic commands based on hand postures or 3D-mouse type of pointing”, and that “real-time interaction based on 3D hand model-based gesture analysis is yet to be demonstrated”. 2. A NATURAL GESTICULATION PARADIGM Natural human communication is inherently multimodal. One’s interlocutor utilizes nuances of gaze awareness, hand gestural timings, voice prosody, and hand and eye deixes to assist in understanding the cotemporal spoken discourse. If we are to build systems that are able to exploit such behavioral activity in natural interaction, it is essential to derive computationally accessible metrics that can inform systems as to the discourse-level organization of the underlying communication. In this paper, we present a paradigm based on a feature decompositive psycholinguistically derived model that permits us to access the underlying structure and intent of multimodal communicative discourse. We shall discuss the psycholinguistic grounding for this work, introduce the concept of the ‘Catchment’ that bridges the ‘psycholinguistic blackbox’ and instrumentally computable entities, and present several examples of decomposed features that facilitates discourse structuring. We shall present our psycholinguistic basis of our approach, our experimental methods, and some concrete examples of how this paradigm facilitates discourse segmentation. 3. PSYCHOLINGUISTIC BASIS In natural conversation between humans, gesture and speech function together as a ‘co-expressive’ whole, providing one’s interlocutor access to semantic content of the speech act. Psycholinguistic evidence has established the complementary nature of the verbal and non-verbal aspects of human expression [13]. Gesture and speech are not subservient to each other, as though one were an afterthought to enrich or augment the other. Instead, they proceed together from the same ‘idea units’, and at some point bifurcate to the different motor systems that control movement and speech. Consider an example where a speaker says “when you enter the room” while performing a two-handed mirror-symmetric gesture in which her hands begin in front of her, palms facing her torso, move outward in a sweeping action, and terminate with hands to the right and left of her torso, palms facing out. The speech alone indicates the act of entering while the gestures indicate that the doors are normally closed, and that there are double doors. Since human communicative modalities spring from the same semantic source, these modalities cohere topically at a level beyond the local syntax structure. This multimodal structuring occurs at an unwitting, albeit not unintended, level of consciousness. The speaker is actively formulating the discourse content and responding to her interlocutor. One might think of such multimodal utterances as proceeding from a nascent idea unit in the speaker’s mind known as a growth point [14, 15]. This stream of ‘idea units’ move Correspondence Analysis New Observational Discovery Hypothesized Cue Extraction Video & Transcript Psycholinguistic Analysis Transcript-Only Grosz-Style Analysis Detailed Speech Transcription Processing: Video Extraction Hand Tracking Gaze Tracking Audio Feature Detection Single Camera Video & Audio Capture Calibrated 5-Camera Video & Digital Audio Capture Multimodal Elicitation Experiment Figure 1: GSG Experiments Block Diagram through the brain and is unpacked into co-expressive and co-temporal speech and gestural activity. Just as we are unwitting, in natural speech, as to how we form sentences from ideas, we are equally unwitting as to how we employ space and time naturally in gesture (and other head, body, and gaze behavior) at the moment of utterance. Nonetheless, there is intelligible organization in the gesticulation, just as there is intelligible organization in the speech. The challenge is to decode this organization. Before we proceed, we shall introduce a psycholinguistic device called a catchment that serves as the basis of our computational model. The concept of a catchment associates various discourse components; it is a unifying concept [16, 17]. A catchment is recognized when gesture features recur in two or more (not necessarily consecutive) gestures. The logic is that the recurrence of imagery in a speakers thinking will generate recurrent gesture features. Recurrent images suggest a common discourse theme. These gesture features can be detected and the recurring features offer clues to the cohesive linkages in the text with which they co-occur. A catchment is a kind of thread of visuospatial imagery that runs through the discourse to reveal emergent larger discourse units even when the parts of the catchment are separated in time by other thematic material. By discovering the catchments created by a given speaker, we can see what this speaker is combining into larger discourse units – what meanings are regarded as similar or related and grouped together, and what meanings are being put into different catchments or are being isolated and thus seen by the speaker as having distinct or less related meanings. By examining interactively shared catchments, we can extend this thematic mapping to the social framework of the discourse. 4. EXPERIMENTAL APPROACH Figure 1 shows our general experimental approach. We perform an elicitation experiment in which human subjects perform some communicative task that is conducive to the performance of certain multimodal behavior. This experiment is captured in video and audio, and the data is analyzed. We compare the computed multimodal features against a set of carefully manually coded discourse analyses to test the correlation of various multimodal features with discourse phenomena observed in the manual coding. Elicitation Experiments: We employ two sets of elicitations. In the first, a subject describes her home or living space. We call this our ‘living space elicitation’. In the second, we recruited pairs of subjects to serve as speakerinterlocutor pairs. This avoids ‘stranger-experimenter’ inhibition in the discourse captured since the subjects already know one another. The subject is shown a model of a village and told that a family of intelligentwombats have taken over the town theater. She is made privy to a plan to surround and capture the wombats and send them back to Australia. This plan involves collaborators among the villagers, paths of approach, and encircling strategies. The subject communicates these with her interlocutor using the town model, and is videotaped through the discourse. We call this our ‘wombat’ experiment. In our earlier experiments, we employed one camera viewing the subject to capture data for the experiment [18]. This data is thus monocular and 2D in nature. In our current experiment, we apply a three camera setup in our experiments. Two of the cameras are calibrated so that once correspondence between points between the two cameras is established, the 3D positions and velocities can be obtained. The third camera is a closeup of the head. We chose this configuration because our experiment configuration must be portable and easy to set up (some of our cross-disciplinary collaborators collect data in the field). We use a standard stereo calibration technique due to Tsai [19] for camera calibration. This algorithm requires a calibration frame of points whose absolute 3D positions are known in some coordinate system. The algorithm takes into consideration various factors such as radial lens distortion. 1 4.1. Extraction of 3D Hand Motion Traces We apply a parallelizable fuzzy image processing approach known as Vector Coherence Mapping (VCM) [20, 21, 22, 23] to track the hand motion. VCM is able to apply spatial coherence, momentum (temporal coherence), speed limit, and skin color constraints in the vector field computation by using a fuzzy-combination strategy, and produce good results for hand gesture tracking. We apply an iterative clustering algorithm that minimizes spatial and temporal vector variance to extract the moving hands [4, 5, 22, 23]. The positions of the hands in the stereo images are used to produce 3D motion traces describing the gestures. 4.2. Detailed Discourse Analysis We perform a linguistic text transcription of the discourse by hand. This transcription is very detailed, including the presence of breath pauses and other pauses, disfluencies and 1Our experimental setup and equipment are described in http://vislab.cs.wright.edu/KDI/ interactions between the speakers. Barbara Grosz and colleagues [24] have devised a systematic procedure for recovering the discourse structure from a transcribed text. The method consists of a set of questions with which to guide analysis and uncover the speaker’s goals in producing each successive line of text. The result is a carefully transcribed purpose hierarchy that segments the discourse in terms of ‘purpose units’. We also analyze the speech data using the Praat phonetics analysis tool [25] to time tag the beginning of every word in the utterance and the time index of the start and end of every unit in the purpose hierarchy. This gives us a set of time indices of where semantic breaks are expected according to the Grosz analysis. 4.3. Integrative Analysis Finally, we use our Visualization for Situated Temporal Analysis (VisSTA) system [26] to integrate the various data sources. This system permits time-synchronous analysis of video and audio while viewing animated graphs of extracted signal in conjunction with an animated text transcript display to provide simultaneous random access to signal, text, and video.
منابع مشابه
Timing and Rhythm in Multimodal Communication for Conversational Agents
Synthesis of lifelike gesture is finding growing attention in human-computer interaction. In particular, synchronization of synthetic gestures with speech output is one of the goals for embodied conversational agents which have become a new paradigm for the study of gesture and for human-computer interface (Cassell et al., 2000). Embodied conversational agents are computer-generated characters ...
متن کاملLangage de conversation multimodal pour agent conversationnel animé
This article falls within the realm of dialogue between a human and an Embodied Conversational Agent (ECA). We claim that a specific agent conversational language is needed for such interactions, based on the essential role of emotion in human communication. In order to define this language, we propose a library of Multimodal Conversation Acts is proposed, based in particular on speech acts and...
متن کاملSentic Avatar: Multimodal Affective Conversational Agent with Common Sense
The capability of perceiving and expressing emotions through different modalities is a key issue for the enhancement of human-computer interaction. In this paper we present a novel architecture for the development of intelligent multimodal affective interfaces. It is based on the integration of Sentic Computing, a new opinion mining and sentiment analysis paradigm based on AI and Semantic Web t...
متن کاملSpeaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition
Conversational engagement is a multimodal phenomenon and an essential cue to assess both human-human and human-robot communication. Speaker-dependent and speaker-independent scenarios were addressed in our engagement study. Handcrafted audio-visual features were used. Fixed window sizes for feature fusion method were analysed. Novel dynamic window size selection and multimodal bi-directional lo...
متن کاملZooming on Multimodality and Attuning: A Multilayer Model for the Analysis of the Vocal Act in Conversational Interactions
The most recent research about both human-human conversational interaction and human-computer agents conversational interaction is marked by a multimodal perspective. On the one hand this approach underlines the cooccurrence and synergy between different languages and channels, on the other hand it highlights the need for joined and coordinated action between various subjects (attuning and mutu...
متن کاملAdapt - a multimodal conversational dialogue system in an apartment domain
A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001